2025-07-04 21:49:03
JSON, or JavaScript Object Notation, is a lightweight data-interchange format. It is widely used for transmitting data between a server and a web application, due to its simplicity and compatibility with many programming languages.
The JSON format has a simple syntax with a fixed number of data types such as strings, numbers, Booleans, null, objects, and arrays. Strings must not contain unescaped control characters (e.g., no unescaped newlines or tabs); instead, special characters must be escaped with a backslash (e.g., \n for newline). Numbers must follow valid formats, such as integers (e.g., 42), floating-point numbers (e.g., 3.14), or scientific notation (e.g., 1e-10). The format is specified formally in the RFC 8259.
Irrespective of your programming language, there are readily available libraries to parse and generate valid JSON. Unfortunately, people who have not paid attention to the specification often write buggy code that leads to malformed JSON. Let us consider the strings, for example. The specification states the following:
All Unicode characters may be placed within the
quotation marks, except for the characters that MUST be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).
The rest of the specification explains how characters must be escaped. For example, any linefeed character must be replaced by the two characters ‘\n’.
Simple enough, right? Producing valid JSON is definitively not hard. Programming a function to properly escape the characters in a string can be done by ChatGPT and it only spans four or five lines of code, at most.
Sadly, some people insist on using broken JSON generators. It is a recurring problem as they later expect parsers to accept their ill-formed JSON. By breaking interoperability you lose the core benefit of JSON.
Let me consider a broken JSON document:
{"key": "value\nda"}
My convention is that \n is the one-byte ASCII control character linefeed. This JSON is not valid. What happens when you try to parse it?
Let us try Python:
import json json_string = '{"key": "value\nda"}' data = json.loads(json_string)
This program fails with the following error:
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 15 (char 14)
So the malformed JSON cannot be easily processed by Python. Not good.
What about JavaScript?
const jsonString = '{"key": "value\nda"}'; let data = JSON.parse(jsonString);
This fails with
SyntaxError: Bad control character in string literal in JSON at position 14 (line 1 column 15)
Not great.
What about Java? The closest thing to a default JSON parser in Java is jackson. Let us try.
import com.fasterxml.jackson.databind.ObjectMapper; import java.util.Map; void main() { String jsonString = "{\"key\": \"value\nda\"}"; Map<String, Object> data = parseJson(jsonString); }
I get
JSON parsing error: Illegal unquoted character ((CTRL-CHAR, code 10)): has to be escaped using backslash to be included in string value
What about C#?
using System.Text.Json; string jsonString = "{"key": "value\nda"}"; using JsonDocument doc = JsonDocument.Parse(jsonString);
And you get, once again, an error.
In a very real sense, the malformed JSON document I started with is not JSON. By accommodating buggy systems instead of fixing them, we create workarounds that degrade our ability to work productively.
We have a specific name for this effect: technical debt. Technical debt refers to the accumulation of compromises or suboptimal solutions in software development that prioritize short-term progress but complicate long-term maintenance or evolution of the system. It often arises from choosing quick fixes, such as coding around broken systems instead of fixing them.
To avoid technical debt, systems should simply reject invalid JSON. They pollute our data ecosystem. Producing correct JSON is easy. Bug reports should be filled with people who push broken JSON. It is ok to have bugs, it is not ok to expect the world to accommodate them.
2025-07-03 21:38:07
C and C++ compilers like GCC first take your code and produce assembly, typically a pure ASCII output (so just basic English characters). This assembly code is a low-level representation of the program, using mnemonic instructions specific to the target processor architecture. The compiler then passes this assembly code to an assembler, which translates it into machine code—binary instructions that the processor can execute directly.
When compiling code, characters like ‘é’ in strings, such as unsigned char a[] = "é";, may be represented in UTF-8. The Unicode (UTF-8) encoding for ‘é’ is two bytes, \303\251. However, when this is represented as an assembly string, it requires 8 characters to express those two bytes (e.g., "\303\251") because the assembly is ASCII. Thus, a single character in source code can expand significantly in the compiled output.
As a related issue, new versions of C and C++ have an ‘#embed’ directive that allows you to directly embed an arbitrary file in your code (e.g., en image). Such data might be encoded inefficiently as assembly.
What could you do?
Base64 is an encoding method that converts binary data into a string of printable ASCII characters, using a set of 64 characters (uppercase and lowercase letters, digits, and symbols like + and /). It is commonly used to represent binary data, such as images or files, in text-based formats like JSON, XML, or emails (MIME).
When starting from binary data, base64 data expands the data, turning 3 input bytes into 4 ASCII characters. Interestingly, in some cases, base64 can be used for compression purposes. Older versions of GCC would compile
unsigned char a[] = "éééééééé";
to
.string "\303\251\303\251\303\251\303\251\303\251\303\251\303\251\303\251"
GCC 15 now supports base64 encoding of data during compilation, with a new “base64” pseudo-op. Our array now gets compiled to the much shorter string
.base64 "w6nDqcOpw6nDqcOpw6nDqQA="
2025-06-27 21:33:12
Back when I started programming, project teams were large. Organizations had dozens of programmers on sprawling projects. Reusing code was not trivial. Sharing code required real effort. Sometimes you had to hand over a disk or recopy code from a printout.
Better networking eased code reuse. Open source took over. We no longer have someone writing a red-black tree from scratch each time we need one.
What happened to those writing red-black trees when you could just download one from the internet?
They moved on to other work.
Generative AI (Copilot and friends) is basically just that: easier code reuse. If you’re trying to solve a problem that 20 engineers have solved and posted online, AI will solve it for you.
What happens to you? You move on to other, better things.
Software is supposed to be R&D. You’re supposed to solve problems no one has quite solved before.
But aren’t people predicting mass unemployment due to AI?
I suggest not making claims about the future without looking at hard data first The unemployment rate in Canada is relatively high right now, but historical data shows no ChatGPT discontinuity (before 2022, after 2022).
Maybe unemployment due to AI will hit computer science graduates hardest? Compared to what? Biology or business graduates? Where’s the data? Whether you should go to college at all is another question, but if you recommend opting out of computer science while still going to college, please provide the ChatGPT-safe alternative.
Can anyone become a programmer today? Let’s look at what happens on campus. Over 90% of students use AI for coursework. I estimate the percentage is higher in computer science. Let’s say all computer science students use AI.
A couple of years ago, I added a chatbot to my intro-to-programming class (now using RAG + GPT-4). With over 300 students a year, it’s a good test case.
I have grade data, and there’s no visible before/after ChatGPT effect. Last term, the failure rate was about the same as five years ago.
I don’t have data from other universities, but I haven’t heard anyone complain that students breeze through programming classes. This is despite AI being able to handle most assignments in an intro-to-computing course.
The challenge in an intro-to-programming class was never finding answers. Before ChatGPT, you could find solutions on Google or StackOverflow. Maybe it took longer, but it’s a quantitative difference, not a qualitative one, for elementary problems.
The skill you need to develop early on is reading code and instantly understanding it. If you’re a programmer, you might forget you have this skill. ChatGPT can give you code, but can your brain process it?
If you give Donald Trump ChatGPT, he still won’t code your web app. He’ll get code but won’t understand it.
More broadly, OECD economists predict worker productivity could grow by up to 0.9% per year thanks to AI over the next ten years. I’m not sure I trust economists to predict a decade ahead, but I trust they’ve studied recent trends carefully. And economist will tell you that a 0.9% rise in worker productivity per year is not at all a break from historical trends.
In fact, by historical standards, productivity growth is low. We are not at all in a technological surge like our grand-parents who lived through the electrification of their country. Going from no refrigerator to a refrigerator and a TV, that’s a drastic change. Going from Google to ChatGPT is nothing extraordinary.
Can generative AI bring about more drastic changes? Maybe. I hope so. But we don’t have hard evidence right now that it does.
2025-06-22 09:55:37
Herb Sutter just announced that the verdict is in: C++26, the next version of C++, will include compile-time reflection.
Reflection in programming languages means that you have access the code’s own structure. For example, you can take a class, and enumerate its methods. For example, you could receive a class, check whether it contains a method that returns a string, call this method and get the string. Most programming languages have some form of reflection. For example, the good old Java does have complete reflection support.
However, C++ is getting compile-time reflection. It is an important development.
I announced a few months ago that thanks to joint work with Francisco Geiman Thiesen, the performance-oriented JSON library simdjson would support compile-time reflection as soon as mainstream compilers support it.
This allows you to take your own data structure and convert it to a JSON string without any effort, and at high speed:
kid k{12, "John", {"car", "ball"}}; simdjson::to_json(k); // would generate {"age": 12, "name": "John", "toys": ["car", "ball"]}
And you can also go back, given a JSON document, you can get back an instance of your custom type:
kid k = doc.get<kid>();
The code can be highly optimized and it can be thoroughly tested, in the main library. Removing the need for boilerplate code has multiple benefits.
To illustrate the idea further, let me consider the case of object-to-SQL mapping. Suppose you have your own custom type:
struct User { int id; std::string name; double balance; private: int secret; // Ignored in SQL generation };
You want to insert an instance of this user into your database. You somehow need to convert it to a string such as
INSERT INTO tbl (id, name, balance) VALUES (0, '', 0.000000);
How easy can it be? With compile-time reflection, we can make highly efficient and as simple as single function call:
generate_sql_insert(u, "tbl");
Of course, the heavy lifting still needs to be done. But it only needs to be done once.
What might it look like? First, we want to generate the column string (e.g., id, name, balance).
I do not have access yet to a true C++26 compiler. When C++26 arrive, we will have features such as ‘template for’ which are like ‘for’ loops, but for template metaprogramming. Meanwhile, I use a somewhat obscure ‘expand’ syntax.
Still the code is reasonable:
template<typename T> consteval std::string generate_sql_columns() { std::string columns; bool first = true; constexpr auto ctx = std::meta::access_context::current(); [:expand(std::meta::nonstatic_data_members_of(^^T, ctx)):] >> [&]<auto member>{ using member_type = typename[:type_of(member):]; if (!first) { columns += ", "; } first = false; // Get member name auto name = std::meta::identifier_of(member); columns += name; }; return columns; }
This function is ‘consteval’ which means that you should expect it to get evaluated at compile time. So it is very efficient: the string is computed while you are compiling your code. Thus the following function might just return a precomputed string:
std::string g() { return generate_sql_columns<User>(); }
Next we need to compute the string for the values (e.g., (0, ”, 0.000000)). That’s a bit trickier. You need to escape strings, and handle different value types. Here is a decent sketch:
template<typename T> constexpr std::string generate_sql_valuess(const T& obj) { std::string values; bool first = true; constexpr auto ctx = std::meta::access_context::current(); [:expand(std::meta::nonstatic_data_members_of(^^T, ctx)):] >> [&]<auto member>{ using member_type = typename[:type_of(member):]; if (!first) { values += ", "; } first = false; // Get member value auto value = obj.[:member:]; // Format value based on type if constexpr (std::same_as<member_type, std::string>) { // Escape single quotes in strings std::string escaped = value; size_t pos = 0; while ((pos = escaped.find('\'', pos)) != std::string::npos) { escaped.replace(pos, 1, "''"); pos += 2; } values += "'" + escaped + "'"; } else if constexpr (std::is_arithmetic_v<member_type>) { values += std::to_string(value); } }; return values; }
You can now put it all together:
template<typename T> constexpr std::string generate_sql_insert(const T& obj, const std::string& table_name) { constexpr std::string columns = generate_sql_columns<T>(); std::string values = generate_sql_valuess(obj); return "INSERT INTO " + table_name + " (" + columns + ") VALUES (" + values + ");"; }
It is just one of many applications. The important idea is that you can craft highly optimized and very safe code, that will get reused in many instances. The code looks a bit scary, as C++ tends to, but it is quite reasonable.
In the coming years, many projects will be simplified and optimized thanks to compile-time reflection.
Code: I have posted a complete implementation in the code repository of my blog. I am sure it can be significantly improved.
2025-06-21 08:32:39
In North America, my home province of Quebec has a slightly higher life expectancy than the rest of the country. It is also a poorer-than-average province, so that is maybe surprising. For 2023, the life expectancy at birth in Ontario was 82.33 years, whereas it was 82.55 years for Quebec.
However, if drill down in the data, you find that the differences has to do with men: men in Quebec live on average to be 80.77 years, the highest among Canadian provinces. Females in Ontario and British Columbia live longer than in Quebec.
What could explain Quebec’s men longevity?
People in Quebec smoke more then the Canadian average. It is probably not why Quebec men live longer..
Alcohol consumption patterns are worth examining. Unfortunately, data on alcohol use by sex for Canadian provinces were unavailable. Focusing on overall alcohol consumption, Quebec stands out with significantly higher levels. The correlation between total alcohol consumption (across men and women) and male longevity is positive but modest, with an R-squared value of 0.27.
One hypothesis suggests that variations in obesity rates may be a key factor. Quebec differs from other Canadian provinces in several ways. Notably, men in British Columbia and Quebec tend to have lower body mass indices compared to those in other regions. The correlation between obesity and longevity is more obvious (R square of 0.59).
In conclusion, I do not know why men in Quebec tend to live a bit longer than men in other provinces, but they tend to drink a lot more alcohol and they are leaner.
2025-06-16 22:42:38
Prescientific and preindustrial thought tied truth to authority and tradition. “We’ve always done it this way.” “The king decrees it.” “We know this is how it’s done.”
The scientific and industrial revolutions shattered this mindset. Suddenly, outsiders could challenge entrenched norms. Two brothers in a rundown workshop could build empires to rival the wealthiest lords. Ordinary people could question the highest authorities.
What made this possible? A belief in objective truth. Reality exists independently of our perceptions. For Christians, this meant God created a discoverable world, and humanity’s role was to uncover its workings and harness them.
How could an individual stand against tradition or authority? Through facts. A new method is cheaper—or it isn’t. It’s faster—or it isn’t. It yields better results—or it doesn’t.
This is the essence of progress: you can improve, and you can prove it.
The British Empire dominated the seas, waged wars, and abolished slavery not because of superior manpower or land, but through better governance, business practices, and science.
Not every question is empirical, but the critical ones often are. To make your case, build it and demonstrate its value. That is the heart of the scientific revolution